[V1][Core] Add a cache hit threshold for requests by kfirwolfson · Pull Request #24520 · vllm-project/vllm

kfirwolfson · 2025-09-09T15:49:18Z

[V1][Core] Add a cache hit threshold for requests

Purpose

Introduce an optional KV-cache hit-rate gating mechanism, discussed in RFC #24256, to skip requests that are unlikely to benefit from prefill in P/D disaggregated deployments.

Edit: an additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up. The main problem is that the external router (such as llm-d / Dynamo / Production Stack) orchestrating PD has no control over this vLLM behavior once the Decode instance received the request. Setting a small cache hit-rate threshold on the request (say 0.001), will reject this Prefill work in case of preemption, and the request will be sent back to the calling Router / Side Car / Worker.

What this PR adds

Global setting: --global-cache-hit-threshold ([0.0–1.0], default 0.0)
Per-request override: cache_hit_threshold ([0.0–1.0]) in incoming request ChatCompletionRequest / CompletionRequest (validated in the protocol layer).
Finish reason: New enum value and string "cache_threshold" exposed via v1 engine API. Requests rejected by this gating return HTTP 200 with finish_reason="cache_threshold" and no output tokens.
Config visibility & hashing: Threshold is included in VllmConfig and SchedulerConfig.
Bounds & validation: All threshold values validated to range [0.0, 1.0].

Why

Enables Decode-first optimization in P/D disaggregation: when computed-token ratio (local+external) over prompt length is below the threshold, we avoid scheduling low-benefit prefills on decode nodes. This reduces wasted work and remote KV transfers when cache reuse is insufficient.

Backwards compatibility

Default is 0.0 → feature is disabled by default. No behavior change unless the threshold is set globally or per request.

Test Plan

1) Unit Tests

Unit tests check the scheduler logic, including

request threshold overrides global threshold
cache hits from local or external KV cache, or both

2) E2E manual tests

Run vllm serve with --global-cache-hit-threshold 0.8 argument to set a some default value. We'll override it in most requests.

vllm serve <model_path> --served-model "Llama-3.1-8B-Instruct" --global-cache-hit-threshold 0.8

Scheduler computes hit_ratio = computed_tokens / prompt_tokens

We will send 4 requests. Note the order of sending them matters as the first request fills the cache other depend on

First request with cache_hit_threshold: 0 so it’s guaranteed to execute and populate the KV-cache

Request1: Short 26 tokens will be the prefix of future requests.

Following requests are sent with cache_hit_threshold: 0.33

Request2: Long prompt ≈ 58 tokens → ratio 16/58 ≈ 0.28 → rejected as ratio below threshold
Request3: Medium prompt ≈ 40 tokens → ratio 16/40 ≈ 0.4 → normal generation

The next request is sent without a cache_hit_threshold field, which means global value of 0.8 will take effect

Request4: Medium prompt ≈ 39 tokens → ratio 16/39 ≈ 0.41 → rejected as ratio below global threshold

Request 1) Warm the cache

This run uses cache_hit_threshold: 0 so it’s guaranteed to execute and populate the KV-cache for the base segment.

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to fill the default block size",
    "max_tokens": 20,
    "cache_hit_threshold": 0
  }'

Request 2) MISS case

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to fill the default block size. Then we continue with many words so that the token length will exceed 16*3 and cache hit rate will be too low to pass the test case threshold",
    "max_tokens": 20,
    "cache_hit_threshold": 0.33
  }'

Expected: HTTP 200 with "finish_reason": "cache_threshold"

Request 3) HIT case

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to be the shared prefix but continue with with whatever text tokens we like and keep it medium after all",
    "max_tokens": 20,
    "cache_hit_threshold": 0.33
  }'

Expected: normal generation ("finish_reason" is not "cache_threshold").

Request 4) MISS case using global threshold

Use global threshold set to 0.8

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to be the shared prefix and now continue with different text so the hit rate will be too low",
    "max_tokens": 20
  }'

Expected: HTTP 200 with "finish_reason": "cache_threshold"

Notes

Exact token counts can vary slightly by tokenizer/model; we go the numbers above using Llama-3.1-8B-Instruct

Test Result

E2E Local smoke tests on a single node:

Below threshold: responses returned 200 with finish_reason: "cache_threshold" and empty outputs.
- Validated with debug logs
- Request threshold:
  - Request cmpl-410004b615a54d73b7e9f0deebf2b852-0 rejected: cache hit rate 0.28 < threshold 0.33 (request)
- Global threshold:
  - Request cmpl-6d66ba796f9247fcadca54ae428bf790-0 rejected: cache hit rate 0.41 < threshold 0.80 (global)
At/above threshold: normal token generation.
Validators rejected out-of-range values and accepted on boundaries 0.0 and 1.0 (not detailed above)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a cache hit threshold to gate requests, which is a useful optimization for disaggregated deployments. The implementation is mostly solid, covering configuration, API exposure, and the core scheduling logic.

I've identified a critical issue that could lead to a ZeroDivisionError in the scheduler when processing requests with empty prompts. Additionally, there's a code duplication issue in the protocol validation that should be addressed to improve maintainability. My detailed comments provide suggestions for fixing these issues.

vllm/v1/core/sched/scheduler.py

vllm/entrypoints/openai/protocol.py

github-actions · 2025-09-09T15:54:32Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

robertgshaw2-redhat · 2025-09-16T18:04:50Z

@robertgshaw2-redhat self tag

mergify · 2025-10-03T08:35:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kfirwolfson · 2025-12-11T14:28:43Z

Updated after merge of #26813. Would appreciate help in reviewing @markmc, @njhill, @tlrmchlsmth

markmc · 2025-12-19T12:24:20Z

I'm sorry we have been slow to give feedback on this. And all I have at this stage are some very high-level comments ...

For the decode preemption case - I think it's most natural to look at this first through the same lens as --kv-transfer-config '{"kv_load_failure_policy": "fail"} - a deployment-time choice that prefill should not happen in a decode instance

For the "decode-first P/D" case - this is introducing an alternative KV transfer protocol flow, an additional step before the current prefill-first flow kicks in, and might e.g. lead to additional kv_transfer_params in the prefill request

Both of these make sense, but I'm wary of the specifics of the proposal:

A single cache_hit_threshold concept might seem like an elegant solution to both, but to me it's a bit obtuse - it's not obvious (e.g. from the --help output) that this is something that (presumably?) will only interest KV transfer users
A threshold becomes yet another tunable knob, yet the way I've described the two features above doesn't immediately suggest that tuning is required
A per-request threshold is another leap in complexity and tunability, which feels premature to me

I think I'd be more immediately supportive of doing this in baby-steps, each of which could be a standalone PR:

For the decode preemption case, add --kv-transfer-config '{"kv_decode_only": true}' or similar - I'd go so far as to deprecate the load failure policy config in favour of this one. I'm honestly a bit confused why we're not already using kv_role for this?
For the decode-first mode, add --kv-transfer-config '{"enable_decode_first": true}'
If there's a strong case to be made for having a tunable threshold, that can be added as KV transfer config
Finally, if there's a strong case for per-request tuning of this threshold, that can build upon all of the above

mergify · 2025-12-19T12:26:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kfirwolfson · 2025-12-22T11:12:39Z

Thanks for the detailed discussion points, @markmc, answering inline.

For the decode preemption case - I think it's most natural to look at this first through the same lens as --kv-transfer-config '{"kv_load_failure_policy": "fail"} - a deployment-time choice that prefill should not happen in a decode instance

We think it’s a bit too strict to decide “no prefill at all” in the Decode instance. There are cases in which a little prefill on the Decode node makes sense, as exemplified below.

For the "decode-first P/D" case - this is introducing an alternative KV transfer protocol flow, an additional step before the current prefill-first flow kicks in, and might e.g. lead to additional kv_transfer_params in the prefill request

Not sure what you are referring to in “additional kv_transfer_params in the prefill request”. The “decode-first P/D” flow is an alternative to the current prefill-first flow. If there is enough cache, the Decoder will handle the request and there will be no remote prefill request (and no kv_transfer_params returned from the Prefiller). If the cache hit is low, then the flow continues in a similar way to the current prefill-flow: send the request to the Prefiller and then send it to the Decoder with the kv_transfer_params the Prefiller returned.
Note that in cases of a shared KV Cache storage there are no kv_transfer_params returned, as the Prefiller simply writes the cache directly to the shared storage and the Decoder reads them from the shared storage, as depicted in the diagram in the RFC #24256.

Both of these make sense, but I'm wary of the specifics of the proposal:
• A single cache_hit_threshold concept might seem like an elegant solution to both, but to me it's a bit obtuse - it's not obvious (e.g. from the --help output) that this is something that (presumably?) will only interest KV transfer users
• A threshold becomes yet another tunable knob, yet the way I've described the two features above doesn't immediately suggest that tuning is required

Not sure if you mean “KV Connector” or “KV Transfer”, as they are not exactly interchangeable. The feature can be useful for just KV-offloading using KV-Connectors as well (without direct transfers). One example is described under “Other Scenarios” in the RFC description, where cache is offloaded to local DRAM or disks but not transferred, and there is no P/D-D. A more practical use-case is a PD-D scenario with a shared KV Cache reachable by both Prefillers and Decoders, without direct KV Transfers between the two.
The help text talks about P/D optimizations, as the main use-case. Are you suggesting the “global-cache-hit-threshold” should be moved to the kv_transfer_params?

• A per-request threshold is another leap in complexity and tunability, which feels premature to me
I think I'd be more immediately supportive of doing this in baby-steps, each of which could be a standalone PR:

For the decode preemption case, add --kv-transfer-config '{"kv_decode_only": true}' or similar - I'd go so far as to deprecate the load failure policy config in favour of this one. I'm honestly a bit confused why we're not already using kv_role for this?

As mentioned above and described in the RFC, some prefill work on the Decoder may be acceptable. This is also in “happy paths” and not just for Preemption – some connectors for example move data at the token-block or chunk of blocks resolution. In those cases, the cache will not contain the tail of the request (beyond block/chunk boundaries), which will be calculated using prefill on the Decoder.

For the decode-first mode, add --kv-transfer-config '{"enable_decode_first": true}'

Decode first is a router/sidecar flow, not something vLLM needs to be aware of. vLLM doesn’t really have has anything to do with this information, as it just handles incoming requests as they come. The global or per-request threshold allows the router/sidecar to control the flow (the diagram in the RFC shows this). A flag would not suffice and we need a tunable threshold.

If there's a strong case to be made for having a tunable threshold, that can be added as KV transfer config

Finally, if there's a strong case for per-request tuning of this threshold, that can build upon all of the above

We believe the tunable threshold per request gives the highest flexibility. Without it the system is fixed and the threshold has to be determined at loading time. Thinking about it some more, if we want to reduce complexity, we can decide to avoid a global-threshold altogether and use only per-request thresholds.

An example usage for this flow can be seen in the parallel work now being done for adding Decode-first support to llm-d, see feat(lmcache): implement decode first flow on lmcache connector when cache_hit_threshold field is present

If you want, we would love to set up some time to review the suggestion as a whole.

mergify · 2026-01-20T08:33:15Z

Hi @kfirwolfson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-20T08:38:57Z

Hi @kfirwolfson, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-28T09:22:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Fix Gemini CR comments Add unit tests Move from SamplingParams to request unit test remake fix static code analysis rejects Fix unit test fix after local CR fix pre-commit reject add threshold to request logger and fix some calls to encode fix ruff Signed-off-by: Kfir Wolfson <kfirw@pliops.com>

…oject#32726 review Signed-off-by: Kfir Wolfson <kfirw@pliops.com>

kfirwolfson requested review from ProExpertProg, WoosukKwon, aarnphm, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 9, 2025 15:49

mergify bot added frontend v1 labels Sep 9, 2025

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/protocol.py Outdated Show resolved Hide resolved

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 3425995 to 7c0485e Compare September 9, 2025 16:31

kfirwolfson requested review from chaunceyjiang and heheda12345 as code owners September 14, 2025 05:30

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0b75346 to 8be6b61 Compare September 14, 2025 05:58

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 8be6b61 to 0400566 Compare September 30, 2025 10:24

kfirwolfson requested a review from ApostaC as a code owner September 30, 2025 10:24

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 4d756b7 to 0c15acc Compare September 30, 2025 12:59

mergify bot added the needs-rebase label Oct 3, 2025

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0c15acc to 0c9cb3f Compare October 6, 2025 06:06

markmc requested review from NickLucche and removed request for WoosukKwon, alexm-redhat, robertgshaw2-redhat, yewentao256 and ywang96 December 19, 2025 12:25

markmc added the kv-connector label Dec 19, 2025

mergify bot added the needs-rebase label Dec 19, 2025

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from a897a78 to b3123e1 Compare December 21, 2025 08:34

mergify bot removed the needs-rebase label Dec 21, 2025

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from b3123e1 to 48130a2 Compare December 29, 2025 13:41

kfirwolfson mentioned this pull request Jan 5, 2026

feat: support cache threshold finish reason header to return cache_threshold finish reason llm-d/llm-d-inference-sim#296

Merged

markmc mentioned this pull request Jan 19, 2026

[P/D] Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes #32553

Open

5 tasks

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 48130a2 to 3d5fba0 Compare January 20, 2026 08:29

kfirwolfson requested a review from markmc as a code owner January 20, 2026 08:29

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 3d5fba0 to f5f75ef Compare January 20, 2026 08:34

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from f5f75ef to 6d05bc0 Compare January 20, 2026 16:33

kfirwolfson mentioned this pull request Jan 25, 2026

[RFC]: Add a cache hit threshold to handle Preemptions in PD-Disaggregation and enable lightweight powerful P/D implementations #24256

Open

1 task

mergify bot added the needs-rebase label Jan 28, 2026

Kfir Wolfson added 2 commits January 28, 2026 11:29

fix stats for rejected requests. Also suggested as part of PR vllm-pr…

d11d1fa

…oject#32726 review Signed-off-by: Kfir Wolfson <kfirw@pliops.com>

kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 641a7c8 to d11d1fa Compare January 28, 2026 09:31

mergify bot removed the needs-rebase label Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[V1][Core] Add a cache hit threshold for requests#24520

[V1][Core] Add a cache hit threshold for requests#24520
kfirwolfson wants to merge 2 commits intovllm-project:mainfrom
kfirwolfson:feature/kv-cache-hit-threshold

kfirwolfson commented Sep 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 9, 2025

Uh oh!

robertgshaw2-redhat commented Sep 16, 2025

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

kfirwolfson commented Dec 11, 2025 •

edited

Loading

Uh oh!

markmc commented Dec 19, 2025

Uh oh!

mergify bot commented Dec 19, 2025

Uh oh!

kfirwolfson commented Dec 22, 2025

Uh oh!

mergify bot commented Jan 20, 2026

Uh oh!

mergify bot commented Jan 20, 2026

Uh oh!

mergify bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

kfirwolfson commented Sep 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

1) Unit Tests

2) E2E manual tests

Request 1) Warm the cache

Request 2) MISS case

Request 3) HIT case

Request 4) MISS case using global threshold

Notes

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 9, 2025

Uh oh!

robertgshaw2-redhat commented Sep 16, 2025

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

kfirwolfson commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc commented Dec 19, 2025

Uh oh!

mergify bot commented Dec 19, 2025

Uh oh!

kfirwolfson commented Dec 22, 2025

Uh oh!

mergify bot commented Jan 20, 2026

Uh oh!

mergify bot commented Jan 20, 2026

Uh oh!

mergify bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kfirwolfson commented Sep 9, 2025 •

edited by github-actions bot

Loading

kfirwolfson commented Dec 11, 2025 •

edited

Loading